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Summary of Research 

Commercially available database systems do not meet the information and processing 
needs of scientist who need to access multiple autonomous databases organized and maintained 
under a variety of DBMSs. With the accessibility of these databases via readily available com- 
puter networks, users will demand the capability to jointly manipulate data in different data- 
bases. The objective of this project is to illustrate the concept of incremental access to distri- 
buted and autonomously managed databases. An experimental database management system, 
ADMS±, which has been developed at the University of Maryland, in College Park, is used as 
the testbed of this research. 

In this research, we investigated a number of design and performance issues of interoper- 

4 ^ 

able database management systems. The major results of our investigation were obtained in the 
areas of Client-Server Database architectures for heterogeneous DBMSs, incremental computa- 
tion models, buffer management techniques, and query optimization. We finished a prototype of 
an advanced Client-Server workstation- based DBMS which allows access to multiple hetero- 
geneous commercial DBMSs. Experiments and simulations were then run' to compare its perfor- 
mance with the standard client-server architectures. 

The focus of this research was on adaptive optimization methods of heterogeneous database 
systems. Adaptive buffer management accounts for the random and object oriented access 
methods for which no known characterization of the access patterns exist. Adaptive query 
optimization means that value distributions and selectivities, which play the most significant role 
in query plan evaluation, are continuously refined to reflect the actual values as opposed to static 
one that are computed offline. 
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Query feedback is a concept that was first introduced to the literature by our group. We 
employed query feedback for both adaptive buffer management and for computing value distri- 
butions and selectivities. For adaptive buffer management, we use the page faults of prior exe- 
cutions to achieve more "informed” management decisions. For the estimation of the distribu- 
tions of the selectivities, we use curve-fitting techniques, such as "least squares" and "splines", 
for regressing on these values. 

1. ViewCache: An Incremental Access Method for Database Access 

Query processing in relational databases can be sped up by maintaining views as stored sets 
of pointers, called ViewCaches , pointing to those tuples in the underlying relations needed to 
materialize the views. A ViewCache is a stored collection of pointers pointing to records of 
underlying relations needed to materialize a view [Roussopoulos 1991]. In this paper, an Incre- 
mental Access Method (1AM) that amortizes the maintenance cost of ViewCaches over a long 
time period or indefinitely is developed and studied analytically and experimentally. Amortiza- 
tion is based on deferred and other update propagation strategies. A deferred update strategy 
allows a ViewCache to remain outdated until a query needs to selectively or exhaustively 
materialize the view. At that point, an incremental update of the ViewCache is performed. This 
paper defines a set of conditions under which incremental access to the ViewCache is cost- 
effective. The decision criteria are based on some dynamically maintained cost parameters 
which provide accurate information but require inexpensive bookkeeping. 

The IAM capitalizes on the ViewCache storage organization for performing the update and 
the materialization of the ViewCaches in an interleaved mode using one-pass algorithms. Com- 
pared to the standard technique for supporting views that requires reexecution of the definition of 
the view, the IAM offers significant performance advantages. We will show that under 
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favorable conditions, most of which depend on the size of the incremental update logs between 
consecutive accesses of the views, the incremental access method outperforms query 
modification. Performance gains are higher for multilevel ViewCaches because all the I/O and 
CPU for handling intermediate results is avoided. 

The IAM allows for a number of update strategies to be implemented: 

a) Immediate update broadcast 

All affected ViewCaches are immediately updated. This strategy maintains minimal 
response time at the expense of system overhead. For some applications and some views, 
this may be desirable for instant response time on view retrieval. 

b) Periodic update broadcast 

All affected ViewCaches are periodically updated. This strategy can also be used to do 
updates during periods of low system load or at prespecified times. 

c) Event triggered update 

A ViewCache is updated when an event occurs. For example, a particular event can be 
triggered when the thresholds for the IAM cost-effectiveness on the incremental update 
logs have been reached (selective update propagation). 

d) Deferred update 

All ViewCaches stay outdated until a direct or indirect reference to them is made; at that 
time, the IAM for a ViewCache is invoked if it is cost-effective. This avoids any system 
overhead associated with immediate propagation of updates. 

Strategies (b)-(d) implement the deferred (lazy) approach, that postpones the incremental 
update propagation either for a while (strategies (b) and (c)) or indefinitely (strategy (d)) unless a 
direct or indirect reference to a ViewCache is made. 
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The algorithms and the results obtained for the IAM are applicable to all strategies (a)-(d). 
Therefore, the strategies can be combined, allowing the user to specify different strategies for 
views depending on their importance, frequency of access, and other application dependent cri- 
teria. The main difference among these strategies is how they control the size of the incremental 
update logs. The trade-off is between system overhead and query processing time. The deferred 
strategies incur less or no overhead on the system at the cost of having to process bigger incre- 
mental update logs during the query processing. 

In addition to the theoretical (worst case) analysis presented in [Roussopoulos 1991], 
experiments were run on our prototype runing on a Sun 3/280. These compared the reexecution 
indexed loop join algorithm with the corresponding incremental one. Both algorithms use a sin- 
gle B-tree as a secondary index. The experiments consisted of mixed complexity queries run 
several times against the database. Complete analysis can be found in [Roussopoulos 1991] for 
a variety of query/update loads drawn from the Wisconsin benchmark. 

In [Roussopoulos et al 1993] we presented three more applications of the Incremental 
Access Methods including a Prolog record-at-a-time access method facilitated by VIEWCACHE 
and its IAM and an architecture for a multi-site federation of database client-servers. In the 
latter architecture, each client workstation not only submits database requests to the servers, but 
in addition, it runs a DBMS caching into its local disk data transmitted during query processing. 
The Client-Server architecture paradigm is the foundation for distributed and cooperative pro- 
cessing in several application areas. Such a paradigm is of paramount importance as the size and 
complexity of databases increase. Downloaded data is then on accessed locally, thus, saving 
data transmission over the network. Updates are done on the servers which act as the primary 
sites of the data stored in them. 
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Downloaded cached data is maintained incrementally, by transmitting, on demand, from 

i 

the appropriate servers the updated records of the relevant data between two consecutive 
requests of each workstation. The main advantage of this architecture is that it distributes query 
processing and disk accessing on multiple workstations, and this save to it. The performance is 
tremendously increased because all cached data is accessed in parallel. Thus, this enhanced 
client-server architecture enjoys features of database machines with multiple disks. We have 
developed simulation packages for measuring the impact of this architecture on query 
throughput. The simulations show that, depending on the frequency of updates, speed-ups of up 
to linear in the number of participating workstations. It also shows that the architecture scales 
up very in a very satisfactory manner. 

The incremental access methods described above provide the foundation for client-server 
database architectures described in the next sections. 

1.1. Client-Server Database Architectures 

The use of multiprocessor systems in databases and database machines have been 
thoroughly investigated over the past decade. These efforts tried to predominantly optimize the 
performance of large join operations by utilizing multiple disks and processors. Aside from their 
limited commercial success, under certain conditions, they managed to achieve some of their 
objectives. Their major limitation though was their excessive cost. The main point of this 
investigation was to show that the demand for higher system throughput in Database Manage- 
ment Systems (DBMSs) can be achieved by combining off-the-self systems running on multiple 
but single-CPU hardware. Such generic hardware solutions are less costly and a lot more flexi- 
ble. 
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Until recently, high throughput database processing was undertaken by a large mainframe 
that was the dedicated machine for all the data processing. During the last decade we have 
experienced a number of developments that are changing the way DBMSs are build and operate. 
First, we have seen the spectacular introduction and wide use of high-end workstations with 
very low prices. Second, disk units become larger and more reliable and finally, computer net- 
work technology has matured and offers reliable operations for file transfers, remote access and 
message handling. In this paper, we are concerned with the performance of modem Client- 
Server (CS) database architectures. 

All CS architectures we have studied consist of a number of workstations (clients), one or 
more large workstation(s) or mainframe(s) which undertake the role of the server(s), and a local 
area network connecting them all together. We assume that multiple databases running on dif- 
ferent servers are autonomous and that no inter-database transactions exist (which may cause 
inter-database blocking and additional delays). The client functionality ranges from just running 
the application with no caching on either main memory or disk and minimal or no decoupling 
between the client and the server, all the way to having full cache management capability on the 
client and high degree of decoupling and data distribution. Each server can have either a single 
or an array of disks for parallel I/O. 

1.2. Evaluation of Client-Server Database Architectures 

Several CS database architectures can be built with the above functionality. This investiga- 
tion concentrated on the following: (a) no caching on the clients and single disk on each server. 
This is the Standard and minimal functionality configuration (SCS). (b) No caching on the 
clients but multiple disks on each server (CS-MD) for data replication and parallel I/O. (c) 
Enhanced disk cache management functionality on the clients for dynamic data migration and 
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incremental maintenance of cached data, and a single disk on the server (ECS). Downloaded 
and cached data provides the replication needed for parallel I/O. (d) As in (c) with the addition 
of a special purpose buffer manager dedicated for facilitating the access of the server logs for the 
incremental cache management (ECS-LB). 

We chose for evaluation the above (b-d) configurations as they are most comparable with 
regard to hardware (equal number of disks and roughly equivalent replication overhead). The 
SCS is used as the basis for our metric. 

There are a number of studies that deal with similar issues to those we are discussing here. 
Hagman and Ferrari are among the first who tried to split the functionality of a database system 
and off-load parts of it to dedicated back-end machines. Among other results, they found that 
generally there is a 60 overhead in disk I/O and a suspected overhead of similar size for CPU 
cycles. Roussopoulos and Kang proposed the coupling of a number of workstation-based 
DBMSs loosely-coupled with a mainframe DBMS. The cooperation included incremental and 
lazy maintenance of dowloaded and cached data on the client databases. 

We started our investigation in the area of database system architectures for heterogeneous 
information systems, the overhead of heterogeneous computing and the use of buffering and 
incremental computation models for alleviating the overhead. We then extended the Client- 
Server architectures to take advantage of the disks available on the clients to provide parallel 
access to replicated data. 

In order to evaluate the benefits of such architectures, we have designed and implemented 
simulation packages for comparing several a tightly coupled architectures such as ordinary 
client-server, and loosely coupled ones such as the Enhanced Client-Server (ECS) and com- 
pletely loose multidatabase architectures. Three such architectures are being studied and our 
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first results are reported in [Roussopoulos & Delis 1991] and [Delis & Roussopoulos 1992], 

The ECS architecture is based on incremental computation models [Roussopoulos 1991]. 
The results of queries originating from a workstation are materialized and incrementally main- 
tained on the workstation. The architecture provides distributed functionality while ensuring 
high availability and low communication overhead. Incremental computation models combined 
with advanced indexing and storage strategies are used to achieve high performance [Rousso- 
poulos et al 1993]. Gateways to multiple relational databases and to a network database are 
currently under development. Our implementation of the ADMS± prototype at the moment pro- 
vides incremental access to Oracle, Ingres and ADMS. The prototype is used to reassure us 
about the results of the simulations. 

The simulation was extended to measure the performance of all client-server architectures 
under variant sizes of main memory for buffering. The buffering clearly had an effect on the 
execution but did not change any of the trends observed in our non-buffering simulations. 

We then extended our investigation in the area of database system architectures for hetero- 
geneous and distributed information systems, the use of buffering and incremental computation 
models for alleviating the overhead of distributing computing [Delis & Roussopoulos 1992] and 
the benefits obtained by the parallel disk access observed in the ECS architecture. We were par- 
ticularly interested in studying scalability of I/O systems and developed a prototype for measur- 
ing context switching and network latency overhead. 

One of the biggest concerns in distributed database architectures is the management and the 
update control of dispersed and replicated data. The severe overhead in maintaining consistency 
resulted in the limited use of distributed systems or to systems that allow very restricted mode of 
updates. In the Enhanced Client-Server architecture, controlling the downloaded data is very 
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important because this determines the performance. We apply incremental update algorithms to 
bring into sync the client data. However, if the client is disconnected for a long period, incre- 
mental may not be the most efficient way and retransmission of the whole data subset be more 
appropriate. However, to avoid this situation, a background process running in the client can 
trigger the incremental refresh. Another technique that can be used, is periodic propagation of 
updates to the clients. These techniques and alternative ones are examined in [Delis & Rousso- 
poulos 1994], 

2. Adaptive Database Systems 

In the context of Adaptive DBMSs, we have concentrated on buffer management tech- 
niques, the design of dynamic query optimization techniques, and the concept of query feedback 
for adapting buffer allocation strategies. 

2.1. Adaptive Buffer Management 

Database buffers are used in most systems in order to avoid excessive disk IOs. Previous 
research on database buffer management primarily based on the classifications of page reference 
patterns exhibited by simple relational operations. However, fine classification and pattern pred- 
iction are difficult for complicated queries or in advanced data models. 

Previous works on buffer allocation are based either exclusively on the availability of 
buffers at runtime or on the access patterns of queries. Our group introduced in [Ng et al 1991] a 
unified approach for buffer allocation in which both of these considerations are taken into 
account. Our approach is based on the notion of marginal gains which specify the expected 
reduction on page faults in allocating extra buffers to a query. In the first half of this paper, we 
present mathematical models and formulas computing marginal gains for various access patterns 
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exhibited by relational database queries. The paper proposes a class of buffer allocation algo- 
rithms based on marginal gains. Finally, the paper presents simulation results that show that our 
approach is promising and our algorithms achieve significant improvements on throughput, as 
compared with previous methods. 

Although in the above paper we proposed the marginal gains-based policy that takes both 
run-time and access patterns of queries into account, and allows flexible (sub-optimal) buffer 
allocations to queries, this method, as well as all previously proposed, is static, that is it uses 
pre-calculated parameters to make the buffer allocation decisions. In [Faloutsos et al 1991], we 
extend that idea in order to design an adaptable buffer allocation algorithm, that automatically 
optimizes itself for changing query workloads. The basic concept used is that of predicting the 
effect an allocation may have to the throughput of the system, and making the allocation that 
maximizes the expected throughput. 

However, all these methods are heavily dependent on the assumed pattern of access and 
any deviation from these can cause thrashing. To remedy this, we proposed an efficient scheme 
based on query feedback to achieve better buffer utilization. We devise a feedback mechanism, 
Faulting Characteristic Model (FCM), [Chen 8c Roussopoulos 1993], to associate each executed 
query with a few records that characterize its page access and faulting history. Based on FCM, 
an allocation scheme, termed Marginal Gain Ratio (MGR), is employed to adjust the allocations 
for recurring queries. Another scheme, called Never-Used-Again (NUA), is used to optimize the 
buffer replacements. Experiment results show that, MGR and NUA outperform the 
classification-based methods by 15-20% and 20-70% in average system throughput, respectively. 
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2.2. Estimation of Selectivities Using the Query Feedback 

We also developed a novel approach for estimating the record selectivities of database 
queries. The actual attribute value distribution is approximated by a curve-fitting function using 
a query feedback. This approach has the advantages of incurring no additional overhead for 
database statistics collection, and adapting to the evolving data value distributions during 
updates. 

In most database systems, query optimization chooses efficient execution plans for database 
queries. In order to recognize the best plan, the optimizer must make accurate estimates of the 
costs of alternative plans. One of the most important factors that affect the plan cost is data 
selectivity, which is the number of records satisfying a given predicate. Accurate selectivity 
estimation attracted a great deal of research which resulted in a variety of methods. In sum- 
mary, all these methods rely on the maintenance of the attribute value distributions obtained by 
having to do extra access to the database. In this research we are exploring a different approach 
for approximating the attribute distribution using query feedback. The idea is to use the size of 
the query results as feedback to re-shape (regress) the attribute distribution. 

This approach has the following characteristics: 

(1) No additional overhead: Since query feedback is obtained as a bonus of query execution, it 
eliminates additional dictionary accesses for maintaining statistics. This benefits even more 
systems in which database access is expensive (e.g. very large/tertiary, or mobile data- 
bases). 

(2) Statistics Accuracy: Assumptions or estimates of certain statistics are always used to assist 
in planning an efficient query evaluation. Very often both the assumptions are wrong and 
statistics are inaccurate. . Query feedback provides high accuracy based on real experience, 
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and removes the need of having to do simplistic assumptions on value distributions such as 
uniformity or questionable rules of thumb. 

(3) System Independence: Since query feedback, by itself, involves only the results rather than 
the internals of the query evaluations, it can be best customized for use in heterogeneous 
database systems. We are developing techniques which will introduce sample quenes to 
run for the purpose of accurately estimating the cost of accessing heterogeneous DBMSs. 

(4) Efficiency: Unlike the previous methods, no off-line database scan and statistical data gath- 
ering is needed to form the distribution. Such overhead overhead is proportional to the 
database size. In contrary, the overhead in the proposed method is constant time for each 
query, regardless of the database size. 

This research was reported in [Chen & Roussopoulos 1994]. 

In a different direction, we developed a technique on dynamic or parametric queiy optimi- 
zation. In existing query optimizers, the values of many important run-time parameters of the 
system, the data, or the query are unknown at query optimization time. Dynamic, or parametric, 
query optimization attempts to identify several execution plans, each one of which is optimal for 
a subset of all possible values of the run-time parameters. In [Ioannidis et al 1992] we present a 
general formulation of this problem and study it primarily for the buffer size parameter. We 
have adopted randomized algorithms as the main approach to this style of optimization and 
enhance them with a sideways information passing feature that increases their effectiveness in 
the new task. Experimental results of these enhanced algorithms show that they optimize 
queries for large numbers of buffer sizes in the same time needed by their conventional versions 
for a single buffer size, without much sacrifice in the output quality. 


August 31, 1994 


Page 13 


Final Report 

Contract Number USRA 5555-09 


3. Conclusions 

The major results of this project are in the areas of Incremental Access Methods, Client- 
Server Database architectures for heterogeneous DBMSs, adaptive buffer management tech- 
niques, and adaptive query optimization. In this final report a short description of these were 
outlined while the full details of these were published in the open literature. 


During this three year contract, is period, we finished a prototype of an advanced Client- 
Server workstation-based DBMS which allows access to multiple heterogeneous commerc.al 
DBMSs. Experiments and simulations were then run to compare its performance with the stan- 

dard client-server architectures. 


Two new methods were discovered during this contract. The firs, is the notion of marginal 
gains to direct buffer allocation during query execution. The other was the concept of query 
feedback to make follow up quety processing more efficient. This "learning from experience” 


idea was discovered somewhere in the middle of the contract and applied to several problems of 
query optimization. We also continued the fine-tuning of the incremental access methods in the 
client-server database architectures area along with a very extensive simulation on its scalability. 
„ is worth pointing out, that a commercial DBMS vendor, Sybase, is now promoting one of the 
client-server architectures we proposed and doing incremental update processing as our ECS 


configuration. 

The project has been very successful. I, generated a total of fourteen papers. Six of them 
appeared in the top database journals (ACM Transactions of Database Systems, Transactions of 
EEE on Knowledge and Data Engineering, Transactions on Software Engineering, and ACM 
Sigmod Record). Eight papers were presented a. and published by the top international database 
conferences. Four of them appeared a, the International Conference on Very Large Databases in 
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1991, 1992 and 1993. Two other papers appeared in the 1991 and 1994 ACM SIGMOD confer 
ences. One in the 1994 International conference on Extending Database Technology, and one a, 
the 1994 IEEE 14th International Conference on Distributed Computing Systems. The last paper 

received the Best Paper Award. 

Nine students were panially supported by -his contract. Three of them graduated with a 
PhD and three with a MSc. The other three are in various stages of their PhD program. 

4 . Recommendations 

The research on advanced client-server database architectures is just finding its way to 
commercial applications. I. will be beneficial to verify the results obtained by our simulations 
with real data from one of the many installations a. NASA. This will test the scalability on a 
real network "in-action" as opposed to a simulation on a University local area network. 

The research on adaptive database systems jus. put the foundation for this area. It has a lot 
of potential and can be exploited in many other systems areas where performance is critical. I. 
has already generated a lot interest both to tire University research community (follow up 
research is forthcoming from University of Wisconsin and University of Waterloo) and software 
industry (we go, requests from Tandem, Sybase, and Oracle). This research should be continued 
not only for the database area but should be exported to the systems and architec 
Several perfotmance issues caused by overhead of interoperable distributed systems and 
network-based cooperative software will benefit from the concepts developed in this work. 
NASA should foster research in this area which will help in the replacement of todays monol- 
ithic software ,0 a flexible component-based distributed and cooperative software of the future. 
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